.. _`Text Count Vectorizer`:

.. _`org.sysess.sympathy.machinelearning.count_vectorizer`:

Text Count Vectorizer
=====================

.. image:: count_vectorizer.svg
   :width: 48


Convert a collection of text documents to a matrix of token counts


**Documentation**

Convert a collection of text documents to a matrix of token counts

*Configuration*:


  - *encoding*

    If bytes or files are given to analyze, this encoding is used to
    decode.


  - *decode_error*

    Instruction on what to do if a byte sequence is given to analyze that
    contains characters not of the given `encoding`. By default, it is
    'strict', meaning that a UnicodeDecodeError will be raised. Other
    values are 'ignore' and 'replace'.


  - *strip_accents*

    Remove accents and perform other character normalization
    during the preprocessing step.
    'ascii' is a fast method that only works on characters that have
    an direct ASCII mapping.
    'unicode' is a slightly slower method that works on any characters.
    None (default) does nothing.

    Both 'ascii' and 'unicode' use NFKD normalization from
    :func:`unicodedata.normalize`.


  - *lowercase*

    Convert all characters to lowercase before tokenizing.


  - *analyzer*

    Whether the feature should be made of word n-gram or character
    n-grams.
    Option 'char_wb' creates character n-grams only from text inside
    word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features
    out of the raw, unprocessed input.

    .. versionchanged:: 0.21

    Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
    first read from the file and then passed to the given callable
    analyzer.


  - *ngram_range*

    The lower and upper boundary of the range of n-values for different
    word n-grams or char n-grams to be extracted. All values of n such
    such that min_n <= n <= max_n will be used. For example an
    ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
    unigrams and bigrams, and ``(2, 2)`` means only bigrams.
    Only applies if ``analyzer is not callable``.


  - *stop_words*

    If 'english', a built-in stop word list for English is used.
    There are several known issues with 'english' and you should
    consider an alternative (see stop_words).

    If a list, that list is assumed to contain stop words, all of which
    will be removed from the resulting tokens.
    Only applies if ``analyzer == 'word'``.

    If None, no stop words will be used. max_df can be set to a value
    in the range [0.7, 1.0) to automatically detect and filter stop
    words based on intra corpus document frequency of terms.


  - *max_df*

    When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold (corpus-specific
    stop words).
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.


  - *min_df*

    When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold. This value is also
    called cut-off in the literature.
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.


  - *max_features*

    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.


  - *binary*

    If True, all non zero counts are set to 1. This is useful for discrete
    probabilistic models that model binary events rather than integer
    counts.


*Attributes*:


  - *vocabulary_*

    A mapping of terms to feature indices.


  - *stop_words_*

    Terms that were ignored because they either:

      - occurred in too many documents (`max_df`)
      - occurred in too few documents (`min_df`)
      - were cut off by feature selection (`max_features`).

    This is only available if no vocabulary was given.


*Input ports*:


*Output ports*:
    **model** : model
        Model


**Definition**


*Input ports*


*Output ports*

    :model:  model

        Model


.. automodule:: node_text

.. class:: CountVectorizer